UNC HiSeq mRNAseq gene expression (RSEM)

The goal of this notebook is to introduce you to the mRNAseq gene expression BigQuery table.

This table contains all available TCGA Level-3 gene expression data produced by UNC's RNAseqV2 pipeline using the Illumina HiSeq platform, as of July 2016. The most recent archive (eg unc.edu_BRCA.IlluminaHiSeq_RNASeqV2.Level_3.1.11.0) for each of the 33 tumor types was downloaded from the DCC, and data extracted from all files matching the pattern %.rsem.genes.normalized_results. Each of these raw “RSEM genes normalized results” files has two columns: gene_id and normalized_count. The gene_id string contains two parts: the gene symbol, and the Entrez gene ID, separated by | eg: TP53|7157. During ETL, the gene_id string is split and the gene symbol is stored in the original_gene_symbol field, and the Entrez gene ID is stored in the gene_id field. In addition, the Entrez ID is used to look up the current HGNC approved gene symbol, which is stored in the HGNC_gene_sybmol field.

In order to work with BigQuery, you need to import the python bigquery module (gcp.bigquery) and you need to know the name(s) of the table(s) you are going to be working with:


In [1]:
import gcp.bigquery as bq
mRNAseq_BQtable = bq.Table('isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM')

From now on, we will refer to this table using this variable ($mRNAseq_BQtable), but we could just as well explicitly give the table name each time.

Let's start by taking a look at the table schema:


In [2]:
%bigquery schema --table $mRNAseq_BQtable


Out[2]:

Now let's count up the number of unique patients, samples and aliquots mentioned in this table. We will do this by defining a very simple parameterized query. (Note that when using a variable for the table name in the FROM clause, you should not also use the square brackets that you usually would if you were specifying the table name as a string.)


In [3]:
%%sql --module count_unique

DEFINE QUERY q1
SELECT COUNT (DISTINCT $f, 25000) AS n
FROM $t

In [4]:
fieldList = ['ParticipantBarcode', 'SampleBarcode', 'AliquotBarcode']
for aField in fieldList:
  field = mRNAseq_BQtable.schema[aField]
  rdf = bq.Query(count_unique.q1,t=mRNAseq_BQtable,f=field).results().to_dataframe()
  print " There are %6d unique values in the field %s. " % ( rdf.iloc[0]['n'], aField)


 There are   9530 unique values in the field ParticipantBarcode. 
 There are  10289 unique values in the field SampleBarcode. 
 There are  10291 unique values in the field AliquotBarcode. 

We can do the same thing to look at how many unique gene symbols and gene ids exist in the table:


In [5]:
fieldList = ['original_gene_symbol', 'HGNC_gene_symbol', 'gene_id']
for aField in fieldList:
  field = mRNAseq_BQtable.schema[aField]
  rdf = bq.Query(count_unique.q1,t=mRNAseq_BQtable,f=field).results().to_dataframe()
  print " There are %6d unique values in the field %s. " % ( rdf.iloc[0]['n'], aField)


 There are  20501 unique values in the field original_gene_symbol. 
 There are  20182 unique values in the field HGNC_gene_symbol. 
 There are  20531 unique values in the field gene_id. 

Based on the counts, we can see that there are a few instances where the original gene symbol (from the underlying TCGA data file), or the HGNC gene symbol or the gene id (also from the original TCGA data file) is missing, but for the majority of genes, all three values should be available and for the most part the original gene symbol and the HGNC gene symbol that was added during ETL should all match up. This next query will generate the complete list of genes for which none of the identifiers are null, and where the original gene symbol and the HGNC gene symbol match. This list has over 18000 genes in it.


In [6]:
%%sql

SELECT
  HGNC_gene_symbol,
  original_gene_symbol,
  gene_id
FROM
  $mRNAseq_BQtable
WHERE
  ( original_gene_symbol IS NOT NULL
    AND HGNC_gene_symbol IS NOT NULL
    AND original_gene_symbol=HGNC_gene_symbol
    AND gene_id IS NOT NULL )
GROUP BY
  original_gene_symbol,
  HGNC_gene_symbol,
  gene_id
ORDER BY
  HGNC_gene_symbol


Out[6]:
HGNC_gene_symboloriginal_gene_symbolgene_id
A1BGA1BG1
A1CFA1CF29974
A2MA2M2
A2ML1A2ML1144568
A4GALTA4GALT53947
A4GNTA4GNT51146
AAASAAAS8086
AACSAACS65985
AADACAADAC13
AADACL2AADACL2344752
AADACL3AADACL3126767
AADACL4AADACL4343066
AADATAADAT51166
AAGABAAGAB79719
AAK1AAK122848
AAMPAAMP14
AANATAANAT15
AARSAARS16
AARS2AARS257505
AARSD1AARSD180755
AASDHAASDH132949
AASDHPPTAASDHPPT60496
AASSAASS10157
AATFAATF26574
AATKAATK9625

(rows: 18153, time: 25.5s, 4GB processed, job: job_q9uLYjnsTKOT4eFAyWM1uu2VVPw)

We might also want to know how often the gene symbols do not agree:


In [7]:
%%sql

SELECT
  HGNC_gene_symbol,
  original_gene_symbol,
  gene_id
FROM
  $mRNAseq_BQtable
WHERE
  ( original_gene_symbol IS NOT NULL
    AND HGNC_gene_symbol IS NOT NULL
    AND original_gene_symbol!=HGNC_gene_symbol
    AND gene_id IS NOT NULL )
GROUP BY
  original_gene_symbol,
  HGNC_gene_symbol,
  gene_id
ORDER BY
  HGNC_gene_symbol


Out[7]:
HGNC_gene_symboloriginal_gene_symbolgene_id
A1BG-AS1NCRNA00181503538
A2M-AS1LOC144571144571
AACSP1AACSL729522
AADACP1LOC201651201651
AAED1C9orf21195827
AAMDCC11orf6728971
AAR2C20orf425980
AARDC8orf85441376
AATBCLOC284837284837
AATK-AS1LOC388428388428
ABHD11-AS1WBSCR26171022
ABHD16ABAT57920
ABHD16BC20orf135140701
ABHD17AFAM108A181926
ABHD17BFAM108B151104
ABHD17CFAM108C158489
ABHD18C4orf2980167
ABRACLC6orf11558527
ACKR1DARC2532
ACKR2CCBP21238
ACKR3CXCR757007
ACKR4CCRL151554
ACSM6C10orf129142827
ACTG1P4LOC648740648740
ACTL10C20orf134170487

(rows: 2013, time: 6.1s, 4GB processed, job: job_Qogec9zEdrWGOVbaaDX_G3EioLM)

BigQuery is not just a "look-up" service -- you can also use it to perform calculations. In this next query, we take a look at the mean, standard deviation, and coefficient of variation for the expression of EGFR, within each tumor-type, as well as the number of primary tumor samples that went into each summary statistic.


In [8]:
%%sql

SELECT
  Study,
  n,
  exp_mean,
  exp_sigma,
  (exp_sigma/exp_mean) AS exp_cv
FROM (
  SELECT
    Study,
    AVG(LOG2(normalized_count+1)) AS exp_mean,
    STDDEV_POP(LOG2(normalized_count+1)) AS exp_sigma,
    COUNT(AliquotBarcode) AS n
  FROM
    $mRNAseq_BQtable
  WHERE
    ( SampleTypeLetterCode="TP"
      AND HGNC_gene_symbol="EGFR" )
  GROUP BY
    Study )
ORDER BY
  exp_sigma DESC


Out[8]:
Studynexp_meanexp_sigmaexp_cv
GBM15611.89779351212.378987104070.199951957617
SKCM1045.411892972212.31341436020.427468608873
BRCA10957.33165237752.01860601240.275327567165
PCPG1795.213633872611.935119116520.371165134301
BLCA4089.262588691391.840034910220.19865233916
CESC3049.6153635781.746709705150.181658206783
THYM1208.614080544061.743401254170.202389708948
LGG51611.14719949471.722223088030.154498274553
ESCA18411.67000634611.639538984390.140491696042
SARC2598.586415542961.591047678860.185298238933
UCEC1767.444763264811.58964457860.213525201818
LUSC50210.92946547261.549718098430.141792670677
TGCT1506.983367114491.507638818780.215889955958
KICH669.480020471411.498462839830.158065359073
LUAD5159.895113742031.498256912490.151413814086
HNSC52011.39075895421.488800969290.130702526081
DLBC484.419361662061.474055288880.333544842355
CHOL369.196454544651.461856577080.158958712836
UCS578.098531271171.448583027580.178869844306
ACC797.74425592311.443676932850.186419063004
MESO879.551977523481.433699346750.1500945059
STAD41510.21986085331.393684436530.136370196869
LIHC3719.605808172961.37192898820.142822859201
OV3058.538293024191.198151557960.140326825814
KIRP2909.361405341421.195130234020.127665685913

(rows: 32, time: 2.3s, 11GB processed, job: job_uEvr7cSfInKOsx2ykx_7KH24Uuo)

We can also easily move the gene-symbol out of the WHERE clause and into the SELECT and GROUP BY clauses and have BigQuery do this same calculation over all genes and all tumor types. This time we will use the --module option to define the query and then call it in the next cell from python.


In [9]:
%%sql --module highVar

SELECT
  Study,
  HGNC_gene_symbol,
  n,
  exp_mean,
  exp_sigma,
  (exp_sigma/exp_mean) AS exp_cv
FROM (
  SELECT
    Study,
    HGNC_gene_symbol,
    AVG(LOG2(normalized_count+1)) AS exp_mean,
    STDDEV_POP(LOG2(normalized_count+1)) AS exp_sigma,
    COUNT(AliquotBarcode) AS n
  FROM
    $t
  WHERE
    ( SampleTypeLetterCode="TP" )
  GROUP BY
    Study,
    HGNC_gene_symbol )
ORDER BY
  exp_sigma DESC

Once we have defined a query, we can put it into a python object and print out the SQL statement to make sure it looks as expected:


In [10]:
q = bq.Query(highVar,t=mRNAseq_BQtable)
print q.sql


SELECT
  Study,
  HGNC_gene_symbol,
  n,
  exp_mean,
  exp_sigma,
  (exp_sigma/exp_mean) AS exp_cv
FROM (
  SELECT
    Study,
    HGNC_gene_symbol,
    AVG(LOG2(normalized_count+1)) AS exp_mean,
    STDDEV_POP(LOG2(normalized_count+1)) AS exp_sigma,
    COUNT(AliquotBarcode) AS n
  FROM
    [isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM]
  WHERE
    ( SampleTypeLetterCode="TP" )
  GROUP BY
    Study,
    HGNC_gene_symbol )
ORDER BY
  exp_sigma DESC

And then we can run it and save the results in another python object:


In [11]:
r = bq.Query(highVar,t=mRNAseq_BQtable).results()

In [12]:
#r.to_dataframe()

Since the result of the previous query is quite large (over 600,000 rows representing ~20,000 genes x ~30 tumor types), we might want to put those results into one or more subsequent queries that further refine these results, for example:


In [13]:
%%sql --module hv_genes

SELECT *
FROM ( $hv_result )
HAVING
  ( exp_mean > 6.
    AND n >= 200
    AND exp_cv > 0.5 )
ORDER BY
  exp_cv DESC

In [14]:
bq.Query(hv_genes,hv_result=r).results().to_dataframe()


Out[14]:
Study HGNC_gene_symbol n exp_mean exp_sigma exp_cv
0 SARC RPS4Y1 259 6.061873 5.548601 0.915328
1 COAD XIST 285 6.248359 5.470363 0.875488
2 COAD RPS4Y1 285 6.121045 5.155333 0.842231
3 BLCA GSTM1 408 6.010392 5.006242 0.832931
4 KIRC XIST 533 6.407005 5.271564 0.822781
5 BRCA CPB1 1095 6.559260 5.168064 0.787904
6 LGG RPS4Y1 516 6.704050 5.271334 0.786291
7 LGG XIST 516 7.173960 5.539642 0.772187
8 HNSC ACTC1 520 6.066815 4.653893 0.767106
9 BLCA KRT6C 408 6.000740 4.567136 0.761095
10 LIHC GSTM1 371 6.074415 4.605980 0.758259
11 SARC PLA2G2A 259 6.002323 4.550597 0.758139
12 STAD REG3A 415 6.314804 4.661015 0.738109
13 LUSC MAGEA6 502 6.056004 4.452028 0.735143
14 SARC XIST 259 7.028725 5.046253 0.717947
15 STAD FABP1 415 6.015140 4.304719 0.715647
16 KIRC FGB 533 6.385475 4.557211 0.713684
17 HNSC ACTA1 520 6.272165 4.441155 0.708074
18 BLCA KRT6B 408 6.821403 4.818030 0.706311
19 STAD RPS4Y1 415 6.695497 4.713168 0.703931
20 CESC SPRR2E 304 6.072856 4.266463 0.702546
21 LIHC FXYD2 371 6.026990 4.227431 0.701417
22 BLCA CLCA4 408 6.096983 4.248395 0.696803
23 LIHC EPCAM 371 6.018766 4.176006 0.693831
24 BLCA KRT4 408 6.182480 4.287769 0.693536
25 LIHC CYP1A2 371 6.324730 4.386093 0.693483
26 LUSC MAGEA9B 502 6.100760 4.227037 0.692871
27 LUAD XIST 515 7.728633 5.342959 0.691320
28 KIRC KDM5D 533 6.093867 4.209498 0.690776
29 SARC CHRDL2 259 6.360666 4.375406 0.687885
... ... ... ... ... ... ...
296 STAD TM4SF4 415 6.620646 3.355416 0.506811
297 LIHC HGFAC 371 8.064111 4.084009 0.506443
298 STAD CDX1 415 7.238605 3.665532 0.506387
299 KIRP HBA1 290 6.039078 3.055942 0.506028
300 PRAD CST1 497 6.162322 3.117676 0.505926
301 OV GAL3ST3 305 6.111476 3.089889 0.505588
302 COAD ALDH1L1 285 6.285763 3.170998 0.504473
303 BRCA KLHDC7A 1095 6.339753 3.197413 0.504343
304 LUSC SBSN 502 6.835398 3.444914 0.503982
305 BRCA KRT81 1095 6.511374 3.280551 0.503819
306 SARC TPSAB1 259 6.814440 3.433117 0.503800
307 KIRP ALDOB 290 6.118838 3.080089 0.503378
308 SARC SERPINA3 259 6.266452 3.154244 0.503354
309 BRCA ELF5 1095 6.514652 3.279154 0.503351
310 LIHC SCGN 371 6.090981 3.063063 0.502885
311 BLCA CYP4F22 408 6.981163 3.508778 0.502606
312 PRAD ORM2 497 6.222926 3.123566 0.501945
313 CESC MMP10 304 6.789319 3.405935 0.501661
314 LUAD B3GNT6 515 6.068778 3.042877 0.501399
315 STAD CYP2B6 415 6.206014 3.110105 0.501144
316 SARC SMOC1 259 7.165206 3.589628 0.500980
317 SARC HR 259 6.015107 3.013293 0.500954
318 SARC ABCA8 259 6.951896 3.482219 0.500902
319 CESC SPRR1B 304 9.031974 4.524126 0.500901
320 KIRC SLC5A8 533 7.190818 3.598825 0.500475
321 OV ZNHIT2 305 6.444236 3.224612 0.500387
322 BLCA DES 408 8.655334 4.330614 0.500341
323 LIHC PLPP2 371 6.643719 3.323184 0.500199
324 THCA DCSTAMP 505 8.456327 4.228831 0.500079
325 BLCA CEACAM6 408 7.919889 3.960153 0.500026

326 rows × 6 columns


In [ ]: